By: Michael D’Acampora

The goal of this assignment is give you practice working with accuracy and other recommender system metrics. In this assignment you’re asked to do at least one or (if you like) both of the following: •Work in a small group, and/or •Choose a different dataset to work with from your previous projects

  1. As in your previous assignments, compare the accuracy of at least two recommender system algorithms against your offline data.

  2. Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity.

  3. Compare and report on any change in accuracy before and after you’ve made the change in #2.

  4. As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible. Also, briefly propose how you would design a reasonable online evaluation environment.

library(recommenderlab)
library(tidyverse)

Data import and analysis

data("Jester5k")
dim(Jester5k)
## [1] 5000  100
typeof(Jester5k)
## [1] "S4"

All of the objects in recommenderLab are created under the S4 Object Oriented system, which presents a different approach to thinking about how the dataset is manipulated and how models are used.

# Summary of ratings data
jester_df <- as(Jester5k, 'data.frame')
summary(jester_df$rating) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -9.95   -3.06    1.46    0.85    5.10    9.90

The ratings are on a scale of -10 to +10. Since we have a mean rating of 0.85 and a median of 1.46 we can consider that range the average, and certainly not good by joke standards.

# Total number of ratings
nratings(Jester5k)
## [1] 362106
# Number of ratings per user
summary(rowCounts(Jester5k))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   36.00   53.00   72.00   72.42  100.00  100.00
# Sparsity percentage
nratings(Jester5k) / (dim(Jester5k)[1] * dim(Jester5k[2]))
## [1] 72.421200  0.724212
jester_matrix <- spread(jester_df, key = "item", value = "rating")
jester_matrix

Using tidyr and the gather method we can take a quick snapshot at what the matrix looks like. However, since this was a realratingsMatrix S4 object, it takes more tinkering to get it to do what you’re normally used to doing.

jokes_rm <- Jester5k[rowCounts(Jester5k) > 50]

min(rowCounts(jokes_rm))
## [1] 51
hist(getRatings(jokes_rm), breaks=100)

We can see when we plot a histogram that shows the negative vales occur with similar frequencies and the positive ratgins are more frequent but slope off as you get towards the max rating of 10.

Let’s take a look at the same distribution after normalization.

One by row centering…

hist(getRatings(normalize(jokes_rm, method="center")), breaks = 100)

And the other by Z-score.

hist(getRatings(normalize(jokes_rm, method="Z-score")), breaks = 100)

We can see the peak ratings in this reduced set range from about 0-1.

Lastly we can take a quick look at how many jokes each user rated and the average rating per joke, by taking the row count and column mean, respectively.

hist(rowCounts(jokes_rm), breaks = 50)

hist(colMeans(jokes_rm), breaks = 20) 

Teehee

As a quick aside, we can find the max value, or “funniest” joke. Here it is…

funniest <- which.max(colMeans(Jester5k))
cat(JesterJokes[funniest])
## A guy goes into confession and says to the priest, "Father, I'm 80 years old, widower, with 11 grandchildren. Last night I met two beautiful flight attendants. They took me home and I made love to both of them. Twice." The priest said: "Well, my son, when was the last time you were in confession?" "Never Father, I'm Jewish." "So then, why are you telling me?" "I'm telling everybody."

Evaluation

We will take four different recommender models. User-based Collaborative Filtering (UBCF), Item-based Collaborative Filtering (IBCF), Random recommendations (RANDOM), and selections based on Popularity (POPULAR). For each of the four models we will apply six combinations of similarity (cosine and pearson) and normalization (row center, Z-score, none) to them, for a total of 24 models.

First we build an evaluation set, which we use on the jokes_rm dataset. The data set is split at 80% training, 20% test. Jokes often have tough critics so we will consider a rating of 5 a “good” rating, which is at the edge of the 3rd interquartile range.

eval_set <- evaluationScheme(data = jokes_rm,
                             method = "split",
                             train = 0.8,
                             given = 30,
                             goodRating = 5)
 eval_set
## Evaluation scheme with 30 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=5.000000
## Data set: 3875 x 100 rating matrix of class 'realRatingMatrix' with 314302 ratings.

Now that the evaluation set is created, we will create and evaluate each of the four subject models, with their six subcomponents, and plot Precision-recall and ROC curves to visually evaluate model performance.

UBCF models

ubcf_models <- list(
  ubcf_cos_null = list(name = "UBCF", param = list(method = "cosine", normalize = NULL)),
  ubcf_prs_null = list(name = "UBCF", param = list(method = "pearson", normalize = NULL)),
  
  ubcf_cos_center = list(name = "UBCF", param = list(method = "cosine", normalize = "center")),
  ubcf_prs_center = list(name = "UBCF", param = list(method = "pearson", normalize = "center")),
  
  ubcf_cos_z = list(name = "UBCF", param = list(method = "cosine", normalize = "Z-score")),
  ubcf_prs_z = list(name = "UBCF", param = list(method = "pearson", normalize = "Z-score"))
)

ubcf_eval_results <- evaluate(x = eval_set, 
                              method = ubcf_models, 
                              n = seq(10, 100, 10)
                              )
## UBCF run fold/sample [model time/prediction time]
##   1  [0.011sec/4.013sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.001sec/4.822sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.04sec/4.03sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.038sec/3.731sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.142sec/3.628sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.139sec/4.278sec]
plot(ubcf_eval_results, "prec/rec", annotate = T, main = "Precision Recall")
title("UBCF Precision-recall")

plot(ubcf_eval_results, annotate = T) 
title("UBCF ROC curve")

IBCF Models

ibcf_models <- list(
  ibcf_cos_null = list(name = "IBCF", param = list(method = "cosine", normalize = NULL)),
  ibcf_prs_null = list(name = "IBCF", param = list(method = "pearson", normalize = NULL)),
  
  ibcf_cos_center = list(name = "IBCF", param = list(method = "cosine", normalize = "center")),
  ibcf_prs_center = list(name = "IBCF", param = list(method = "pearson", normalize = "center")),
  
  ibcf_cos_z = list(name = "IBCF", param = list(method = "cosine", normalize = "Z-score")),
  ibcf_prs_z = list(name = "IBCF", param = list(method = "pearson", normalize = "Z-score"))
)

ibcf_eval_results <- evaluate(x = eval_set, 
                              method = ibcf_models, 
                              n = seq(10, 100, 10)
                              )
## IBCF run fold/sample [model time/prediction time]
##   1  [0.211sec/0.295sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.35sec/0.22sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.217sec/0.185sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.289sec/0.2sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.298sec/0.23sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.376sec/0.265sec]
plot(ibcf_eval_results, "prec/rec", annotate = T, main = "Precision Recall")
title("IBCF Precision-recall")

plot(ibcf_eval_results, annotate = T) 
title("IBCF ROC curve")

RANDOM models

Model Selection

After looking at the Precision and ROC curves on the four methods, it appears that the different subsets of models were more accurate than others. Three out of the four included Pearson correlation and two of the four had a Z-score normalization. The below code chunks provide a litle more information to the selected models.

# Set the training, known and unknown sets
training_set <- getData(eval_set, "train")

known_set <- getData(eval_set, "known")

unknown_set <- getData(eval_set, "unknown")
ubcf_rec <- Recommender(data = training_set, method = "UBCF")

ibcf_rec <- Recommender(data = training_set, method = "IBCF")

popular_rec <- Recommender(data = training_set, method = "POPULAR")

random_rec <- Recommender(data = training_set, method = "RANDOM")
ubcf_model <- predict(ubcf_rec, known_set, type = "ratingMatrix")

ibcf_model <- predict(ibcf_rec, known_set, type = "ratingMatrix")

popular_model <- predict(popular_rec, known_set, type = "ratingMatrix")

random_model <- predict(random_rec, known_set, type = "ratingMatrix")

Selected models

I relied on the ROC curver over the Precision-Recall curves since it seems like we have a fairly balanced dataset.

# UBCF Pearson Z-score
ubcf_prs_z_rec <- Recommender(data = training_set, method = "UBCF", parameter = list(method = "pearson", normalize = "Z-score"))

# IBCF Pearson Row Centering
ibcf_prs_c_rec <- Recommender(data = training_set, method = "IBCF", parameter = list(method = "pearson", normalize = "center"))

# POPULAR Cosine similarity Z-score
popular_cos_z_rec <- Recommender(data = training_set, method = "POPULAR", parameter = list(method = "cosine", normalize = "Z-score"))
## Available parameter (with default values):
## normalize     =  center
## aggregationRatings    =  new("standardGeneric", .Data = function (x, na.rm = FALSE, dims = 1, ...)  standardGeneric("colMeans"), generic = "colMeans", package = "base", group = list(), valueClass = character(0), signature = "x", default = new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, dims = 1, ...)  base::colMeans(x, na.rm = na.rm, dims = dims, ...), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "colMeans"), skeleton = (new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, dims = 1, ...)  base::colMeans(x, na.rm = na.rm, dims = dims, ...), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "colMeans"))(x, na.rm, dims, ...))
## aggregationPopularity     =  new("standardGeneric", .Data = function (x, na.rm = FALSE, dims = 1, ...)  standardGeneric("colSums"), generic = "colSums", package = "base", group = list(), valueClass = character(0), signature = "x", default = new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, dims = 1, ...)  base::colSums(x, na.rm = na.rm, dims = dims, ...), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "colSums"), skeleton = (new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, dims = 1, ...)  base::colSums(x, na.rm = na.rm, dims = dims, ...), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "colSums"))(x, na.rm, dims, ...))
## verbose   =  FALSE
# RANDOM Pearson WITHOUT normalization
random_prs_n_rec <- Recommender(data = training_set, method = "RANDOM", parameter = list(method = "pearson", normalize = NULL))

Predictions

ubcf_prs_z_model <- predict(ubcf_prs_z_rec, known_set, type = "ratingMatrix")

ibcf_prs_c_model <- predict(ibcf_prs_c_rec, known_set, type = "ratingMatrix")

popular_cos_z_model <- predict(popular_cos_z_rec, known_set, type = "ratingMatrix")

random_prs_n_model <- predict(random_prs_n_rec, known_set, type = "ratingMatrix")

##Accuracy

The results below show before and after results with different models selected based on specific similarity normalization methods. The UBCF with Pearson Similarity with Z-score normalization was the model with the lowest error rate across all three measures.

error <- rbind(
  UBCF = calcPredictionAccuracy(ubcf_model, unknown_set),
  UBCF_prs_z = calcPredictionAccuracy(ubcf_prs_z_model, unknown_set),
  IBCF = calcPredictionAccuracy(ibcf_model, unknown_set),
  IBCF_prs_c = calcPredictionAccuracy(ibcf_prs_c_model, unknown_set),
  POPULAR = calcPredictionAccuracy(popular_model, unknown_set),
  POPULAR_cos_z = calcPredictionAccuracy(popular_cos_z_model, unknown_set),
  RANDOM = calcPredictionAccuracy(random_model, unknown_set),
  RANDOM_prs_n = calcPredictionAccuracy(random_prs_n_model, unknown_set)
)
error
##                   RMSE      MSE      MAE
## UBCF          4.425417 19.58431 3.491192
## UBCF_prs_z    4.368536 19.08411 3.406390
## IBCF          4.922449 24.23050 3.930794
## IBCF_prs_c    4.427983 19.60703 3.456311
## POPULAR       4.471660 19.99574 3.555921
## POPULAR_cos_z 4.470542 19.98575 3.519076
## RANDOM        6.365253 40.51645 4.939390
## RANDOM_prs_n  6.367879 40.54988 4.949997

Another item that can be tested is a hybrid recommender system that can take features from one more more recommenders on a weighted basis to obtain a little bit of user/item accuracy coupled with novelty and serentipity from the popularity and random models. There were datatype inconsistencies regarding testing the hybrid system, which is a class object in recommenderLab. With a little more time I could have created and evaluated that as well.

The main difference between offline and online datasets is the accuracy testing. With offline, as we used, the recommendations are tested against some “unknown” portion of test set, whereas if we were online that unknown group could be a live user being given a recommendation on the spot. The system can then learn based on click rates which would further improve accuracy and tie together even more interesting recommendations.

As shown here, one could put a bunch of model in a list and run them all, evaluate and choose a model for production. This type of method will continue to get easier with more computing power, but one must slow down and really think through what the goals of the system are, and what kind of experience you want the end user to see.

References

recommenderlab: A Framework for Developing and Testing Recommendation Algorithms

Buidling a Recommendation System with R

Package ‘recommenderlab’